fix(opensearch): Update should clear projects and personas when they are empty by acaprau · Pull Request #8845 · onyx-dot-app/onyx

acaprau · 2026-02-27T01:25:54Z

Description

Currently empty projects and personas are treated as None, meaning no update required, when really an empty list should mean clear these respective fields from the chunk.

This is a followup to #8680 (comment).

How Has This Been Tested?

I went through the effort of adding external dependency unit tests for this function of the old document index interface. Hopefully these fixtures can be used to easily add additional tests when needed. This took me more time and effort than I wanted, please clap.

Additional Options

[Optional] Please cherry-pick this PR to the latest release version.
[Optional] Override Linear Check

Summary by cubic

Fixes updates so sending empty user_projects/personas clears those fields in both OpenSearch and Vespa. Project/persona filters no longer match cleared chunks after clearing.

Bug Fixes
- In update_single for both backends, treat empty lists as “clear”; send empty sets to clear fields and use None when fields are omitted (no-op).
- Added external dependency tests for the old index interface across OpenSearch and Vespa, covering indexing, clearing, filtered retrieval, and Vespa schema readiness; minor test fixture cleanup to avoid module-scoped leakage.

^{Written for commit 1388cdd. Summary will update on new commits.}

greptile-apps · 2026-02-27T01:27:57Z

Greptile Summary

This PR fixes a bug in OpenSearchOldDocumentIndex.update_single where empty user_projects and personas lists were being treated as None (no-op) instead of clearing those fields on the stored chunks. The fix changes the truthiness check (if user_fields.user_projects) to an explicit if user_fields.user_projects is not None check, consistent with the Vespa backend which already handled this correctly. An external-dependency integration test is added to validate the clear-on-empty behaviour end-to-end against both OpenSearch and Vespa.

Key changes:

opensearch_document_index.py: The only logic fix — project_ids and persona_ids conditions now use is not None so that an empty list is converted to an empty set() and propagated to MetadataUpdateRequest, which in turn writes an empty list to OpenSearch and correctly clears those fields.
vespa/index.py: No logic change — the existing is not None checks were already correct; clarifying comments are added for consistency.
search.py: Comment-only reformatting.
test_document_index_old.py: New integration test file that indexes chunks with a non-empty project/persona, calls update_single with empty lists, and asserts the chunks are no longer reachable via project/persona-scoped filters. One TODO on line 287 is missing the required name/ticket annotation (e.g. TODO(ENG-XXXX)(name):).
test_opensearch_migration_tasks.py: Adds a cross-reference TODO comment pointing to the fixture consolidation ticket.

Confidence Score: 5/5

This PR is safe to merge; it is a minimal, well-scoped bug fix with an accompanying integration test.
The logic change is a single-line predicate correction (user_fields.user_projects → user_fields.user_projects is not None) that aligns OpenSearch behaviour with the already-correct Vespa implementation. The downstream handling of an empty set in MetadataUpdateRequest is confirmed to be correct (the is not None guard at the update layer passes the empty list through to OpenSearch). The only issue found is a minor TODO format violation in the test file.
No files require special attention beyond the minor TODO annotation fix in test_document_index_old.py.

Important Files Changed

Filename	Overview
backend/onyx/document_index/opensearch/opensearch_document_index.py	Core bug fix: changes the condition for project_ids/persona_ids in update_single from a truthiness check (treating [] as None/no-op) to an explicit `is not None` check, correctly allowing empty lists to clear those fields in OpenSearch.
backend/onyx/document_index/vespa/index.py	Documentation-only change: adds clarifying comments to the already-correct `is not None` checks for project_ids/persona_ids; no logic changes needed here.
backend/onyx/document_index/opensearch/search.py	Comment reformatting only — no logic changes; knowledge scope comments are reflowed to fit within line length limits.
backend/tests/external_dependency_unit/document_index/test_document_index_old.py	New external-dependency test file covering the clear-on-empty fix for both OpenSearch and Vespa; one TODO at line 287 is missing the required name/ticket annotation. Known limitations (time.sleep, hardcoded previous chunk count) are explicitly documented with comments.
backend/tests/external_dependency_unit/opensearch_migration/test_opensearch_migration_tasks.py	Trivial change: adds a TODO(ENG-3764)(andrei) cross-reference comment to the fixture duplication note; no logic changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["update_single(doc_id, user_fields)"] --> B{user_fields is None?}
    B -- Yes --> C["project_ids = None\npersona_ids = None\n(no-op for both fields)"]
    B -- No --> D{user_fields.user_projects\nis None?}
    D -- "Yes (None)" --> E["project_ids = None\n(no-op)"]
    D -- "No ([] or [1,2,...])" --> F["project_ids = set(user_projects)\n(e.g. set() clears the field)"]
    E --> G{user_fields.personas\nis None?}
    F --> G
    G -- "Yes (None)" --> H["persona_ids = None\n(no-op)"]
    G -- "No ([] or [1,2,...])" --> I["persona_ids = set(personas)\n(e.g. set() clears the field)"]
    H --> J["MetadataUpdateRequest\n(None fields are skipped;\nempty sets clear the field)"]
    I --> J
    C --> J
    J --> K["OpenSearch / Vespa\nupdate chunks"]

Comments Outside Diff (1)

backend/tests/external_dependency_unit/document_index/test_document_index_old.py, line 287-288 (link)

TODO missing name or ticket

The TODO on this line doesn't include an associated name or ticket number, which is required for all TODOs in this codebase. It should follow the style TODO(name): ... or TODO(1234): .... The other TODOs in this same file (e.g., TODO(ENG-3764)(andrei):, TODO(ENG-3765)(andrei):) already conform to this pattern.

Rule Used: Whenever a TODO is added, there must always be an ... (source)

_{Last reviewed commit: 1388cdd}

cubic-dev-ai

1 issue found across 4 files

Confidence score: 4/5

There’s a moderate risk of test failures: module-scoped fixtures in backend/tests/external_dependency_unit/document_index/test_document_index_old.py depend on function-scoped tenant_context, which can trigger pytest ScopeMismatch errors.
This is limited to the test suite and doesn’t imply a runtime regression, so the overall merge risk remains low.
Pay close attention to backend/tests/external_dependency_unit/document_index/test_document_index_old.py - fix fixture scoping to avoid ScopeMismatch errors.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/external_dependency_unit/document_index/test_document_index_old.py">

<violation number="1" location="backend/tests/external_dependency_unit/document_index/test_document_index_old.py:60">
P2: Module-scoped fixtures cannot depend on a function-scoped fixture. `tenant_context` should be module-scoped (or removed from those fixtures) to avoid pytest `ScopeMismatch` errors.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

backend/tests/external_dependency_unit/document_index/test_document_index_old.py

github-actions · 2026-02-27T01:44:21Z

🖼️ Visual Regression Report

Project	Changed	Added	Removed	Unchanged	Report
`admin`	0	0	0	119	✅ No changes
`exclusive`	0	0	0	8	✅ No changes

backend/tests/external_dependency_unit/document_index/test_document_index_old.py

cubic-dev-ai

2 issues found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/scripts/restart_containers.sh">

<violation number="1" location="backend/scripts/restart_containers.sh:32">
P2: The new `--keep-opensearch-data` flag is parsed but never removed from positional args, so passing the flag without volumes (or before them) turns it into the Vespa volume value and breaks the docker run command. Parse args into a filtered list or shift after handling the flag before assigning volumes.</violation>
</file>

<file name="backend/onyx/document_index/vespa/indexing_utils.py">

<violation number="1" location="backend/onyx/document_index/vespa/indexing_utils.py:282">
P1: Avoid logging raw `e.response.text` for non-retryable HTTP errors; it can expose sensitive response content in logs. Log only safe metadata such as status code and document id.

(Based on your team's feedback about avoiding raw exception/URL token leakage in logs.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

backend/onyx/document_index/vespa/indexing_utils.py

backend/scripts/restart_containers.sh

…ts-and-personas-empty

github-actions · 2026-03-05T02:46:54Z

Preview Deployment

Status	Preview	Commit	Updated
✅	https://onyx-preview-1cuvqmi3k-danswer.vercel.app	`b7dcbad`	2026-03-10 20:26:11 UTC

greptile-apps · 2026-03-05T02:49:25Z

backend/tests/external_dependency_unit/document_index/test_document_index_old.py

+@pytest.fixture(scope="function")
+def index_batch_params() -> Generator[IndexBatchParams, None, None]:
+    yield IndexBatchParams(
+        doc_id_to_previous_chunk_cnt={"test_doc": 0},
+        doc_id_to_new_chunk_cnt={"test_doc": 5},
+        tenant_id=get_current_tenant_id(),
+        large_chunks_enabled=False,
+    )


Hardcoded doc_id_to_previous_chunk_cnt will break future tests sharing module-scoped backends

doc_id_to_previous_chunk_cnt={"test_doc": 0} is hardcoded to 0, which is only correct on the very first index call. The document_indices fixture is scope="module", meaning the same OpenSearch and Vespa backends persist across all test functions in this module. When a second test function uses this fixture and calls document_index.index(...), the backend already has 5 chunks for "test_doc" from the previous test run, but the batch params still claim 0 prior chunks exist. This can lead to orphaned/duplicate chunks that make subsequent assertions incorrect.

The PR description explicitly states these fixtures should be reused for additional tests, so this will become a real problem quickly. Consider making the previous chunk count dynamic or resetting it to the correct value based on backend state, or using a per-test unique doc_id.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/tests/external_dependency_unit/document_index/test_document_index_old.py Line: 273-280 Comment: **Hardcoded `doc_id_to_previous_chunk_cnt` will break future tests sharing module-scoped backends** `doc_id_to_previous_chunk_cnt={"test_doc": 0}` is hardcoded to 0, which is only correct on the very first index call. The `document_indices` fixture is `scope="module"`, meaning the same OpenSearch and Vespa backends persist across all test functions in this module. When a second test function uses this fixture and calls `document_index.index(...)`, the backend already has 5 chunks for `"test_doc"` from the previous test run, but the batch params still claim `0` prior chunks exist. This can lead to orphaned/duplicate chunks that make subsequent assertions incorrect. The PR description explicitly states these fixtures should be reused for additional tests, so this will become a real problem quickly. Consider making the previous chunk count dynamic or resetting it to the correct value based on backend state, or using a per-test unique `doc_id`. How can I resolve this? If you propose a fix, please make it concise.

good feedback, ill add a comment in the code for future test authors about this but ive spent so long trying to get this test working that i dont care abt this in this pr.

or i can just clean up the index between test cases. either way ill do this later.

greptile-apps · 2026-03-05T02:49:26Z

backend/tests/external_dependency_unit/document_index/test_document_index_old.py

+                project_id=1,
+                persona_id=2,
+            )
+            time.sleep(1)


Fixed-duration time.sleep makes tests fragile in slow CI environments

Using time.sleep(1) for Vespa and OpenSearch indexing propagation is a known source of test flakiness—on a loaded CI worker 1 second may not be enough for the write to be queryable. The vespa_document_index fixture already demonstrates a robust polling pattern (lines 145–154) that retries for up to 60 seconds. Applying a similar retry-until-consistent approach here would make these assertions more reliable:

import tenacity @tenacity.retry( stop=tenacity.stop_after_delay(10), wait=tenacity.wait_fixed(0.5), reraise=True, ) def assert_chunk_count(document_index, chunk_request, filters, expected_count): chunks = document_index.id_based_retrieval(chunk_requests=[chunk_request], filters=filters) assert len(chunks) == expected_count

This pattern (or an equivalent simple polling loop) applies to the time.sleep(1) at line 342 as well.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/tests/external_dependency_unit/document_index/test_document_index_old.py Line: 316 Comment: **Fixed-duration `time.sleep` makes tests fragile in slow CI environments** Using `time.sleep(1)` for Vespa and OpenSearch indexing propagation is a known source of test flakiness—on a loaded CI worker 1 second may not be enough for the write to be queryable. The `vespa_document_index` fixture already demonstrates a robust polling pattern (lines 145–154) that retries for up to 60 seconds. Applying a similar retry-until-consistent approach here would make these assertions more reliable: ```python import tenacity @tenacity.retry( stop=tenacity.stop_after_delay(10), wait=tenacity.wait_fixed(0.5), reraise=True, ) def assert_chunk_count(document_index, chunk_request, filters, expected_count): chunks = document_index.id_based_retrieval(chunk_requests=[chunk_request], filters=filters) assert len(chunks) == expected_count ``` This pattern (or an equivalent simple polling loop) applies to the `time.sleep(1)` at line 342 as well. How can I resolve this? If you propose a fix, please make it concise.

i know man im tired, just read the comment above in the code about this.

…ts-and-personas-empty

backend/tests/external_dependency_unit/document_index/test_document_index_old.py

…are empty (#8845)

acaprau added 2 commits February 26, 2026 15:02

hi evan

1415976

omg

ac49a02

acaprau requested a review from a team as a code owner February 27, 2026 01:25

cubic-dev-ai bot reviewed Feb 27, 2026

View reviewed changes

backend/tests/external_dependency_unit/document_index/test_document_index_old.py Outdated Show resolved Hide resolved

evan-onyx approved these changes Feb 27, 2026

View reviewed changes

acaprau added 3 commits February 26, 2026 17:46

typing?

ec22f47

types

892e1c7

saving progress

980cc3f

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

backend/tests/external_dependency_unit/document_index/test_document_index_old.py Outdated Show resolved Hide resolved

backend/tests/external_dependency_unit/document_index/test_document_index_old.py Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 4, 2026

View reviewed changes

backend/onyx/document_index/vespa/indexing_utils.py Outdated Show resolved Hide resolved

backend/scripts/restart_containers.sh Show resolved Hide resolved

acaprau added 2 commits March 4, 2026 18:43

Merge branch 'main' into andrei/260226/3/opensearch/fix-update-projec…

d488ad6

…ts-and-personas-empty

idk

418407c

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

acaprau added 2 commits March 10, 2026 13:04

Merge branch 'main' into andrei/260226/3/opensearch/fix-update-projec…

cb41180

…ts-and-personas-empty

commentaries?

b7dcbad

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

acaprau added 2 commits March 10, 2026 14:13

omg we're done

b461133

fixes

deb43a1

evan-onyx approved these changes Mar 10, 2026

View reviewed changes

get rid of module scoped thing

1388cdd

acaprau added this pull request to the merge queue Mar 10, 2026

Merged via the queue into main with commit 17551a9 Mar 10, 2026
94 checks passed

acaprau deleted the andrei/260226/3/opensearch/fix-update-projects-and-personas-empty branch March 10, 2026 21:56

acaprau added a commit that referenced this pull request Mar 11, 2026

fix(opensearch): Update should clear projects and personas when they …

2eb81ed

…are empty (#8845)

acaprau mentioned this pull request Mar 11, 2026

fix(opensearch): Update should clear projects and personas when they are empty (#8845) to release v3.0 #9258

Open

1 task

Conversation

acaprau commented Feb 27, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Additional Options

Summary by cubic

Uh oh!

greptile-apps bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🖼️ Visual Regression Report

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

acaprau Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

acaprau Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

acaprau Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acaprau commented Feb 27, 2026 •

edited by cubic-dev-ai bot

Loading

greptile-apps bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Mar 5, 2026 •

edited

Loading